An Algorithmic Framework for Compression and Text Indexing

نویسندگان

  • Roberto Grossi
  • Ankur Gupta
  • Jeffrey Scott Vitter
چکیده

We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lower-order terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hth-order empirical entropy of the text, Hh. In particular, we provide a tight analysis of the Burrows-Wheeler transform (bwt) establishing a bound of nHh +M(T,Σ, h) bits, where M(T,Σ, h) denotes the asymptotical number of bits required to store the empirical statistical model for contexts of order h appearing in T . Using the same framework, we also obtain an implementation of the compressed suffix array (csa) which achieves nHh + M(T,Σ, h) + O(n lg lgn/ lg|Σ| n) bits of space while still retaining competitive full-text indexing functionality. The novelty of the proposed framework lies in its use of the finite set model instead of the empirical probability model (as in previous work), giving us new insight into the design and analysis of our algorithms. For example, we show that our analysis gives improved bounds since M(T,Σ, h) ≤ min{g′ h lg(n/g h + 1),H h n + lg n + g h }, where g h = O(|Σ|h+1) and g h = O(|Σ|h+1 lg |Σ|h+1) do not depend on the text length n, while H h ≥ Hh is the modified hthorder empirical entropy of T . Moreover, we show a strong relationship between a compressed full-text index and the succinct dictionary problem. We also examine the importance of lowerorder terms, as these can dwarf any savings achieved by high-order entropy. We report further results and tradeoffs on high-order entropy-compressed text indexes in the paper.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

تأملاتی بر نمایه‌ سازی تصاویر: یک تصویر ارزشی برابر با هزار واژه

Purpose: This paper presents various  image indexing techniques and discusses their advantages and limitations.             Methodology: conducting a review of the literature review, it identifies three main image indexing techniques, namely concept-based image indexing, content-based image indexing and folksonomy. It then describes each technique. Findings: Concept-based image indexing is te...

متن کامل

An Effective Approach for Compression of Bengali Text

In this paper, we propose an effective and efficient approach for compressing Bengali Text. This paper focuses on a methodical study on Bengali text compression techniques. The main target of this research is to provide a framework for Bengali text compression; which ensures a simple and computationally inexpensive effective scheme for Bengali text compression. The proposed Bengali text compres...

متن کامل

An In-Memory XQuery/XPath Engine over a Compressed Structured Text Representation

We describe the architecture and main algorithmic design decisions for an XQuery/XPath processing engine over XML collections which will be represented using a self-indexing approach, that is, a compressed representation that will allow for basic searching and navigational operations in compressed form. The goal is a structure that occupies little space and thus permits manipulating large colle...

متن کامل

A Comparing between the impacts of text based indexing and folksonomy on ranking of images search via Google search engine

Background and Aim: The purpose of this study was to compare the impact of text based indexing and folksonomy in image retrieval via Google search engine. Methods: This study used experimental method. The sample is 30 images extracted from the book “Gray anatomy”. The research was carried out in 4 stages; in the first stage, images were uploaded to an “Instagram” account so the images are tagge...

متن کامل

Compression-Domain Text Indexing and Retrieval

Keyword-based text retrieval engines have been and will continue to be essential to text-based information access systems because they serve as the basic building blocks to high-level text analysis systems. Traditionally, text compression and text retrieval are teated as independent problems. Text les are compressed and indexed separately. To answer a keyword-based query, text les are rst uncom...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005